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Abstract 

This paper studies posterior concentration behavior of the base probability mea- 
sure of a Dirichlet process V, given observations associated with m random measures 
sampled from the Dirichlet process, as m and the number of observations mxn tend to 
infinity. The base measure itself may be endowed with a prior, (another) Di richlet pro 



cess, a construction popularly known as the hierarchical Dirichlet process IITeh et al 



2006]. The random measures sampled from Dirichlet process T> serve as mixing mea- 
sures for an exchangeable collection of m mixture distributions, posterior concen- 
tration behavior of which is also investigated. Convergence rates are established in 
transportation distances (i.e. Wasserstein metrics), under the assumption that the true 
base measure has a finite but unknown number of support points in M. d . This theory 
quantifies the benefit of "borrowing strength" in the inference of groups of data with 
small sample size — a heuristic argument commonly used to motivate hierarchical 
modeling. In certain settings, the gain in efficiency can be dramatic, improving from 
a standard nonparametric rate to a parametric rate of convergence. Tools developed 
include transportation distances for nonparametric Bayesian hierarchies, the existence 
of tests for Dirichlet processes, and concentration properties of Dirichlet measures. 



X 



1 Introduction 

Ferg uson's Dirichlet process is a basic building block in nonparametric B ayesian statis- 
tics I Ferguson . 1973 . iBlackwell and MacOueen . 19731. Iseithuramanl. Il994ll . Traditionally 



used as a prior distribution for mixing measures in mixture models, recent advances in mod- 
eling and computation have seen the Dirichlet proc esses routinely bu ilt into hierarchical and 
infinite-dimensional structures in innova tive ways [Hiort et al. ■ l20ldl . A well-known exam- 
ple is the hierarchical Dirichlet process IITeh et all 1200611 . where the Dirichlet base measure 
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itself becomes an object of inference, which is endowed with yet another Dirichlet process 
prior. Hierarchical Bayesian models such as this have been applied successfully in many 
fields, such as computational biology, computer vision and natural language processing, 
partly due to the desire to infer about quantities that can only be represented as latent vari- 
ables in applications, and partly due to improved computation al methods such as M arkov 
Chain Monte Carlo that make such inferences possible. See iTeh and Jordan! 1120 1011 for a 
nice exposition and many examples. 

This paper investigates asymptotic behavior of measure-valued latent variables that 
arise in hierarchical models such as the hierarchical Dirichlet process, reposing on the gen- 
eral framework of posterior asymptot ics developed in the past decades JGhosaletalll200d 

f sn and WassermanlbOQlll (see also lBarron et al.l lfl999l l . Iwalke^ II2004T I . iGhosal and van der Vaart 
)07I1 . malker et al.1 120070). While these work approach posterior asymptotics from the 
viewpoint of density estimation, we shall focus directly on the posterior of the latent vari- 
ables of interest. Se veral examples of such asymptotic theory have b een developed recently 
for mixture models I Rousseau and Mengersen . 2011 . Nguyen . 2012all . 

An appealing aspect well appreciated by modelers and practioners of hierarchical mod- 
eling is the notion of "borrowing strength". Latent variables shared further up in a proba- 
bilistic hierarchy provide an infrastructure through which one may improve the inference of 
a parameter of interest by borrowing from information on related data and parameters also 
included in the model. Latent hierarchies are a Bayesian modeling device for frequenti st 
concepts of shrinkage and random effects (cf. Chapter 5 of Lehmann and Casella 11998]). 
Due to their wide useages, it is of interest to characterize the roles of latent hierarchies and 
their effects on posterior inference in a rigorous manner. Theoretical work addressing such 
questions, particularly for nonparametric and hierarchical models, remains very scarce in 
the literature. 



Model setting Our choice of studying the hierarchical Dirichlet process also stems from 
the simplicity with which the Dirichlet process can be used recursively to construct hierar- 
chies of random probability measures. The main question that the theory comes down to is 
the convergence behavior of the base measure of a Dirichlet process given the observations 
associated with sampled measures that are generated from the Dirichlet process. 

Let be a complete separable metric space equipped with the Borel sigma algebra, 
S?{&) the space of probability measures on 6, and let G € ^(0). Let Q\ , . . . , Q rn be i.i.d. 
draws from a Dirichlet process S> a G, where concentration parameter a > is given and the 
base measure G = Go is unknown. We are interested in making inferences about Go based 
on observations associated with the sampled measures Qi, ... ,Q m - Taking a Bayesian 
approach, the base measure G is assumed random and endowed with a prior distribution 

n G : 



G~n G , Qi,...,Q r 



I „ iid 



J aG- 



(1) 



For the choice of prior ric := where 7 > and H € is non-atom ic and 

known, this construction is called the hierarchical Dirichlet process [ITeh et all 1200611 . Since 
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Qi, ■ ■ ■ , Qm are discrete with probability one, they typically serve as mixing measures for a 
collection of exchangeable mixture distributions. We are restricted to convolution mixture 
distributions, which admit density of the form Pq^x) := Q, L * f(x) = f f(x — 6)Qi(d8), 
where / is a known kernel density function defined with respect to a dominating measure 
on 0. For each i = 1, . . . , m, let = (Yn, . . . , Y in ) be an observed iid n-sample: 

Y ll ,...,Y m \Q l i ^Q i *f. (2) 

In sum, the m samples Y\ , . . . , K^j are drawn from a collection of m exchangeable mix- 
ture distributions, which are linked through the shared and measure-valued random param- 
eter G. Since both the Qj's and G are latent variables in this hierarchical specification, two 
main questions we aim to address are: 

(I) concentr ation behavior of the posterior distribution of the Dirichlet base measure G 
given the data, and 

(II) concentration behavior of the posterior of a mixture distribution, denoted by Q * /, as 
Q is attached to the Bayesian hierarchy in the same way as the QiS, in comparison 
to a "stand-alone" mixture model Q * f, where Q is endowed with an independent 
prior distribution. 



Overview of results Dropping the index i, YLi := (Yi, . . . , Y n ) denotes the generic i.i.d. 
random n-vector according to the generic mixture density Q * f, where Q is sampled from 
Dirichlet process & a c- The marginal density of Yr n i takes the form: 



PY [n] \G(Y[n] 



HQ*f(Y J )& aG (dQ). 

i=i 



(3) 



Given an m x n data set Y^ := {Y^, . . . , Y^)), the posterior distribution of G given 
Y^ takes the form, for any measurable B C .0^{Q): 

I B UT=iPy ln] \G( Y ln]) du o(G) 



U G (G g B\Y, r 



mT=iPY M \G(Y{ n] )dU G (G) 



(4) 



In order to characterize p osterior concent ration behavior for G, we adopt the Wasser- 
stein distance metric (see also IN guy en i2012ah : for some r > 1 



W r (G,G')= inf 

k&T{G,G') 



e'\\ r dK(o,e') 



1/r 



Here T(G, G') is the space all joint distributions on x whose marginal distributions 
are G and G' . Such a joint distribution k is also called a coupling between G and G '. Our 
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theory builds on this and a more general notion o f transportatio n distances that we will 
define, which has its roots in optimal transportation llVillanil. 120081. 

Our first main result (Theorem [T]in Section |2]i establishes the following. Suppose that 
the m x n data set Yjj^ := (Yj^ , . . . , K^j)) are generated by the model specified above 
according to G = Go for some Go G &{&). As m — > oo, also n — > oo at a specified rate 
with respect to m, there is a net of scalars e mj „ J, such that 



n G (Vi(G,G ) 



> e r 



Y 







(5) 



in Pi 



Y j | Gq -probability. An interesting aspect about this result is the interaction between 
two asymptotic quantities m and n. They play asymmetric roles in the model hierarchy: 
m is the number of groups of data, and n is the sample size for each group. The desired 
interaction between m and n depends on the nature of the kernel density /, which in turn 
affects concentration rate e m ,n- For example, suppose that 6 is a bounded subset of W\ 
a < 1, the true base measure Go is discrete with a finite (but unknown) number of sup- 
port points, and / is an ordinary smooth kernel with smoothness (3 > 0. Then for some 
positive constants 71 < 72 < 1 /3d depending only on Go,d,/3, we obtain concentra- 
tion rate e m>n x (n 3d logm/m) 1 ^ 2d+2 \ as m, n tend to infinity under the constraint that 
< n < (m/logm) 72 . The behavior of the posterior of G for the full range of n 



m 



71 



relative to m remains open. 

Our second main result establishes the effect of "borrowing strength" of hierarchical 
modeling. Suppose that an iid n-sample K?, drawn from a mixture model Qo*f is available, 
where Qq = G ^(6) is unknown: 



Ym\Qo ~Qo*/- 



(6) 



In a stand-alone setting Qo is endowed with a Dirichlet prior: Qo 

^ ^qqJ-/q for some known 

ao > and non-atomic base measure #0 £ £?(&). Under mild conditions on the Dirichlet 
process mixture , Nguye n established the following concentration rate, stated in Hellinger 
metric IINguyenL l2012all : 



n Q (h(Q *f,Q* *f)> (log n/n 



1 

d+2 



n 







(V) 



in P Y o_ \q* -probability. Alternatively, suppose that Qo is attached to the hierarchical Dirich- 
let process in the same way as the Qi, . . . , Q m , i.e.: 



G 



iid 



Qo,Qi, . . . , Qm\G ~ 



"a.G- 



(8) 



Implicit in this specification, due to a standard property of the Dirichlet, is the assump- 
tion that Qo shares the same set of supporting atoms as Q±, . . . , Q m , as they share with the 
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(latent) discrete base measure G. In Theorem [2] it is shown that as ft — > oo and m, n — > oo 
at suitable rates, there is <5 m , n ,n 4 such that 

' mP Y° \Q* X P ^ ra] |G!o"P r0babilit y- In g eneraL 

Sm,n,n X (log n/n) 1 /^) + ^2 ^(1/^), 

where e m>ra is an assumed concentration rate for the posterior of G, vq > 1 is a constant. 

Note that additional term e^/n log(l/e mjn ), which suggests decreased efficiency due to the 
maintainance of the latent hierarchy (i.e., the Dirichlet prior $) a c is random due to random 
G). 

However, if m and n grow sufficiently fast relatively to h so that e m ,n is suitably small, 
then the impact of "borrowing of strength" from the m x n data set Y^? on the infer- 
ence about the data set Yj?, is quite striking. In particular, if / is an ordinary smooth 

kernel density, we obtain 5 m . n ^n x (log n/n) 1 / 2 . If / is a supersmooth kernel density with 
smoothness /3 > 0, then 5 m ^ n ^ x (l/n) 1 /^ 2 ). These present sharp improvements from 
nonparametric rate (log h/h) 1 ^ d+2 ^ in Eq. (0. In short, the hierarchical Dirichlet process 
model is beneficial to groups of data with small sample size, while there is an overall loss 
of efficiency in order to maintain a global rate of concentration across all groups of data 
with comparable sample sizes. 

Finally, we note that these results and more in the sequel are only upper bound guaran- 
tees on the concentration of posterior distributions. A more complete understanding should 
involve obtaining lower bounds on concentration rates. Such a theory for hierarchical mod- 
els is currently not available. In the next section the main theorems will be formulated in 
full detail, followed by a discussion of the method of proof and the new technical tools 
introduced. 

Notations W r denotes the L r Wasserstein distance. N(e,Q,W r ) denotes the covering 
number of Q in metric W r . D(e, Q, W r ) is the packing number of the same metric. Several 
divergence measures for probability distributions are employed: K(p, q), h(p, q), V(p, q) 
denote the Kullback-Leibler divergence, Hellinger and variational distance between two 
densities p and q defined with respect to a measure on a common space: K(p,q) = 
f p\og(p/q), h 2 (p, q) = If {y/p - yfqf and V(P, Q) = \f\p-q\. In addition, we 
define K 2 (p,q) = J p[log(p/q)] 2 , x(p,q) = Jp 2 /q- 

2 Main theorems and technical tools 

Throughout the paper the following assumptions will be required: 
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(Al) For some r > l,d > 0, h(f(-\9),f(-\9')) < Ci\\6 - 6'\\ r and 
K(f(-\9), f{-\9')) < Ci\\9 — 6'\\ r V0, 9' G 0. 

(A2) There holds M = sup^, ee x{f (¥) J {¥)) < oo. 

(A3) H G £P(Q) is non-atomic, and for some constant Co > 0, H(B) > coe d for any 
closed ball B of radius e. 

The tools will be developed in a somewhat general setting of (a complete and sep- 
arable metric space). However, specific rates are obtained when is restricted to be a 
bounded subset of R rf , / is a density function on R d that is symmetric around 0, i.e., 
f(x\9) := f(x - 9) such that f B f(x)dx = J_ B f(x)dx for any Borel set B C R d . 
In addition, the Fourier transform of / satisfies f(u) / for all u G K rf . 

In the sequel, / is of either one of two types: we say / is ordinary smooth with param- 
eter p > if \f(co) rX/=i \ u j\^\ — as — t oo for j = 1, . . . , d, for some positive con- 
stant do. Say / is supersmooth with parameter j3 > if YYj=i ex P(} <jJ j\^ /lo)\ > 
as ojj — > oo for j = 1, . . . , d, for some positive constants do, 70. 

2.1 Main theorems 

Theorem 1. Let be a bounded subset ofM. d . Suppose that 

(a) Assumptions (Al— A3) hold. 

(b) Go has a finite number of support points in 0. 

(c) The hierarchical Dirichlet process model specified by Eqs. (fl]) and ©, where param- 
eters a G (0, 1], 7 > and H G &(@) are known. 

Then, as m — > 00 and n — > 00 such that n\(m) < n < n2(m)for some sequences n,2(m) 
and ni(m) — > 00, there holds 



3dl™-™\ V(2d+2) 



in Py ^^-probability for a large constant C. 



y\ m \ 



Remark 1 The details of n\[m) and 712(111) depend on additional conditions of /. Define 

a* := min aGo({9}). 

6»espt Go 

(i) If / is ordinary smooth with parameter f3, then it suffices to set 

4+(2/3+l)d' 

ni(m) x m 3d(>+{2^+T)Z)+(2d+2)^ 
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and ri2 (m) x (m/ log m) 1 ^ 3d , for any d' > d. In particular, if n is allowed to grow 
at the rate n X rai(m) then the posterior concentration rate is 



e m ,n ~ n"?^ (logn) 1 /^ 2 ) - m-T^lGgm) 1 /^ 2 ), 

where 

a* 1 



3d(4+ (2/3 + l)d') + (2d + 2)a* 2d + 2 
(ii) If / is supersmooth with parameter (3, then it suffices to set 



<n M < '" 



logm(logn) Q *( M+2 )/ /3 ^ ~ logm 

In particular, if n satisfies n 3d (log re) a *( 2d + 2 )/£ x lo ™ m , then we obtain the concen- 
tration rate e m) „ x (logn) _a */^ x (log m)~ a *IP. 

Remark 2 Requirements of the type m(m) < n < ri2(m) appear crucial in deriving 
posterior concentration rates in hierarchical models. Briefly, n has to be large so that the 
convergence in the distribution of n-data vectors YLi induces the convergence of mixing 
measures Qi's, which in turn induces the convergence of base measure G. The growth 
rate of n, relatively to m, is also controlled, according to our proof method, in order to 
bound the covering number of the space of densities py jg and to ensure the thickness of 
the prior induced on these densities. It can be observed from the proof that the admissible 
range [m(m), n^m)] may be enlarged if more is assumed about the kernel density /. It is 
an interesting open question to establish the concentration behavior (or the lack thereof) for 
the full range of n, including those that fall outside the admissible range. 

Theorem 2. Let be a bounded subset ofM. d . Suppose that 

(a) Assumptions (Al) and(A2) hold for some r = ro > 1. 

(b) Go has k < oo support points in 0; Qq G ^(0) such that spt Qq C spt Go- 

(c) The hierarchical Dirichlet process model specified by Eqs. (O, © and dHJ, where 
parameters a € (0, > 0, and H € ^(0) are known. 

(d) For each n, there is a net e mj „ = e mi „(n) \. indexed by m, n such that 



U G [W 1 (G,G )>Ce r 



in P™ x P Y o ^-probability, as m — > oo and n = n(m) —> oo at a suitable rate 
with respect to m. Here, C is a constant independent ofh, m, n. 
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As m and n = n(m) — > oo, and then h — > oo, we have 

Orn,n,n 

in P Y o \n* x Py , r -probability, where the rates 5 m nh are given as follows: 



yO v [m] i n 
Y [n]' Y [n] 1 ~^ U 



(i) Sm,n,n ~ (log fi/n) V( d + 2 ) + e^n log(l/e m , n ). 

(h) 5 m)n ^n x (log n/n) 1 / 2 ;// ij ordinary smooth with smoothness (3 > 0, owe? m a«J n 
grow so/ast f/iaf e TOi „ < ra _(a+fc+M °) (log n)~( a+fc ~ 2) } /or some constant M > 
depending only on d, k, (3 and diam(B). 

(iff) 5 m ^ n ^n x (l/n) 1// ^ +2 ), j/"/ jj supersmooth with smoothness (3 > 0, m arccf n grow 
so)^ f/iaf e m , n < n-2(a+fc)/(/3+2)( logri )-2( Q +fc-i) exp (_4n/3/(/3+2))_ 



Remark 1 Condition (b) that spt Q$ C spt Go motivates the incorporation of mixture 
distribution Qo*f to the Bayesian hierarchy as specified by Eq. d8). According to the model, 
Qo shares the same supporting atoms with Qi, . . . ,Q m , as they all inherit from random 
base measure G. Note also that the condition of posterior of G as stated in (d) is slightly 
different from the conclusion reached by Theorem Q] due to the additional conditioning on 
Yr?i. This condition can be proved directly under additional conditions on Qq and Go, by a 
somewhat cumbersome (but conceptually simple) modification of the proof of Theorem Q] 
This presentation is avoided as it is not central to the main message of the present theorem. 

Remark 2 The general concentration rate (log h/h) 1 ^ d+2 ^ + e~m,n log(l/e m ,n) given in 
(i) should be compared to the standard rate (log h/h) l ^ d+2 ^ one is able to get by fitting a 
stand-alone mixture model Qo* f using a separate Dirichlet process prior for Qo (see The- 
orem 6 of iNguyenl [2012a]). The quantity em{n log(l/e mjn ) can be viewed as the overhead 



cost for maintaining the latent hierarchy involving the random Dirichlet prior 3> a G- 



Remark 3 (ii) and (iii) demonstrate the benefits of hierarchical models for groups of data 
with relatively small sample size: when m S> n (and n = n(m) » n) so that e m ^ n is 
sufficiently small, we obtain parametric rates for the mixture density Qo * f- (log h/n) 1 / 2 
for ordinary smooth kernels, and (log n/n) 1 /^ 2 ) for supersmooth kernels. This is a sharp 
improvement over the standard rate (log n/h) 1 ^ d+2 ^ 1 one would get for fitting a stand- 
alone mixture model Qo * f using a Dirichlet process prior. This improvement is possible 
due to the confluence of two factors: By attaching Qo to the Bayesian hierarchy one is 
able to exploit the assumption that random measure Qo shares the same supporting atoms 
as the random base measure G. This is translated to a favorable level of thickness of the 
conditional prior for Qo (given the mxn data ^jj^X as measured by small Kullback-Leibler 
neighborhoods. The second factor is due to our new construction of a sieves (subsets of) 



8 



<5^(0) over which the Dirichlet process concentrates most its mass on, but which have 
suitably small entropy numbers. 

In summary, Theorem Q] establishes posterior concentration of the Dirichlet base mea- 
sure in a hierarchical setting, while Theorem ^demonstrates dramatic gains in efficiency in 
the inference of groups of data with relatively small sample size. For groups with relatively 
large sample size, the concentrate rate is weaken. This quantifies the effects of "borrowing 
of strength", from large groups of data to smaller groups. This is arguably a good virtue 
of hierarchical models: it is the populations with small sample sizes that need improved 
inference the most. 



2.2 Method of proof and tools 

Complete proofs of both main theorems are placed in Section [6] At a high level, the proof of 
posterior concentration for the base measure G follows a general strategy which is formu- 
lated by an abstract posterior concentration theorem for hierar chical model s (T heorem |6]i 



in Section|6] This theorem is built on the general framework of IGhosal et al.1 0200011 . which 
is suitably modified as we work directly on the posterior of quantities of interest (e.g., base 
measure G) rather than the posterior of the marginal density that generates the data p Y ag- 

The bulk of the paper is therefore devoted to verifying conditions of Theorem [6] This 
hinges on a careful study of the relationship between Wasserstein distance between the 
Dirichlet base measures, Wi(G, G'), and Hellinger distance between the densities of n- 
vector Y[„], h(p Y[n] \G,PY [n] \G')- The main challenge here is that G and Py m \g we related 
through a hierarchy of latent random variables that need to be integrated out: first the ran- 
dom variable distributed by mixing measure Q, and second the random measure Q dis- 
tributed by the Dirichlet process. 

To establish this relationship, in Section|3]we develop a general notion of transportation 
distance of Bayesian hierarchies of random measures. This notion plays a fundamental role 
in our theory, and we believe is also of independent interest. With transportation distances 
it is possible to compare between not only two probability measures defined on 0, but 
also two probability measures on the space of measures on 0, and so on. Transportation 
distances are natural for comparing between Bayesian hierarchies, because the geometry 
of the space of support of measures is inherited directly into the definition of the trans- 
portation distances between the measures. Key results in Section [3] include the relationship 
between Wasserstein distance W r (G, G') and transportation distance between two Dirichlet 
processes S> a G and & a 'G'^ an upper bound of the Hellinger distance h(p Y[n] \G,PY in] \G') m 
terms of W r (G, G'). 

The most challenging part of the proof lies in establishing a lower bound of the vari- 
ational distance V(p Yln] \G,PY [n] \G') (subsequently h(p Y[n] \G,PY [n] \G')) in terms of Wasser- 
stein distance W r (G, G'). This is achieved by Theorem [5] in Section [5] The key argument 
of the proof of this theorem lies in the existence of a robust test between two Dirichlet pro- 
cesses & a G and Q a G>, where the robustness here is measured by Wasserstein metric W r on 
&{Q). This existence result (Theorem @] of Section HJ) can be understood as a regularity 
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condition on the concentration of measure along the boundary of a certain set which is used 
to approximate the total variation between two Dirichlet processes. Along with Section |3l 
the development in Sections |4] and [5] may also be of independent interest. 

Finally, the proof of Theorem [2] hinges on a posterior concentration theorem for the 
mixing measure Q, conditionally given the event that the base measure G is perturbed in 
Wasserstein metric W\, see Theorem |7]in Section [7] A useful technique is to show that the 
Dirichlet process places most its mass on sets with relatively small covering number in W r . 
This enables construction of a sieves of subsets of ^(0) which yields favorable rates of 
posterior concentration. 

2.3 Concluding remarks and further development 

We obtained posterior concentration rates for the base measure of a Dirichlet process given 
mxn observations associated with the sampled measures from the Dirichlet process, using 
tools developed with transportation distances. A consequence of this theory is the quantifi- 
cation of "borrowing of strength" phenomenon in a hierarchical and nonparametric setting, 
as illustrated by the hierarchical Dirichlet process. Several limitations remain in these re- 
sults. They are restricted to true Dirichlet base measures with a finite (albeit unknown) 
number of support points; extending the theoiy to base measures that have infinite support 
is worthwhile. The constrained interaction between two asymptotic quantities m and n is 
worthy of further investigation. Finally, it is of interest to obtain lower bounds on concen- 
tration rates, and to develop a minimax optimal theory, for the variables residing in latent 
hierarchies. 



3 Transportation distances of Bayesian hierarchies 



Let 6 be a complete separable metric space (i.e., is a Polish space) and ^(0) be the 
space of Borel probability measures on 0. The weak topology on ^(0) (or narrow topol- 
ogy) is induced by convergence against (7^(0), i.e., bounded continuous test functions on 
0. Since is Polish, &(Q) is itself a Polish space. &(Q) is metrized by the W r Wasser- 
stein distance: for G, G' £ &(Q) and r > 1, 



W r (G, G') 



inf 

KeT(G,G') 



"dn{e, 



l/r 



By a recursion of notations, ^ 2 \Q) := is defined as the space of Borel 

probability measures on <$^(0). This is a Polish space, and will be endowed again with a 
Wasserstein metric that is induced by metric W r on <$^(0): 



W r (P,V) 



inf 



W;(G,G') dK(G,G') 



l/r 



We can safely reuse notation W r as the context is clear from the arguments. Since the 
cost function \\9 — 9'\\ is continuous, the existence of an optimal coupling k S T{G,G') 
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whic h achieves the infimum is guaranteed due to the tightness of T(G, G') (cf. Theorem 
4.1 of lvillanil ||2008Id . Moreover, W r (G, G') is a continuous function and T(D, V) is again 



tight, so the existence of an optimal coupling in T(T>, V) is also guaranteed. 

Now we present a lemma on a monotonic property of Wasserstein metrics defined along 
the recursive construction for every pair of centered random measures on 0. Part (b) hight- 
lights a very special property of the Dirichlet process. In what follows P denotes a generic 
measure-valued random variable. By J PdV = G we mean f P(A)dV = G(A) for any 
measurable subset A C G. 

Lemma 1. (a) LetV,V G ^ {2 \@) such that J PdV = G and J PdV = G'. For 

r > 1, ifW r (V, V) is finite then W r {V, V) > W r (G, G'). 

(b) IfV = Q) aG and V = & a c, then W r (V,V) = W r (G, G') if both quantities are 
finite. 

Proof, (a) Take an arbitrary coupling K, G T(V, V) for which f W^(Q, Q')dK, is bounded. 
Let (Q, Q') be a pair of random measures whose law is /C. Given Q, Q', let kq,q' denote 
an asso ciated optimal coupling of (Q, Q') that is chosen in a measurable way (cf. Corollary 
5.22 of lvillanil ibOQSln . By Fubini's theorem, 



j w r r {Q,Q') die = J J \\e - e'\\ r d KQ , Q ,(0,e') die 

9 - '\\ r t K Q,Q'(dO, d9')dJC. 



We note that the second integral in the last equation of the previous display can be 
written as n(d6, d9'), for some valid coupling k G T(G, G'). To see this, by marginalizing 
out 9' we have for any measurable A G 0, 



/ 



k q ,q,(A xQ)dK = j Q(A) dJC = J Q(A)dV = G(A). 



The first equality is by the definition of kq^qi; the second is by the definition of /C; the 
third is by the assumption on V. A similar identity holds for marginalizing out 9. Thus, 
fW;(Q,Q') dK > mf Ker(GiG ,) / \\9 - 9'\\ r dK = W;(G,G'). This inequality holds for 
any coupling K G T(V,V), so W;(V,V) > W$(G, G'). 

(b) Let k be an optimal coupling of (G, G'), i.e., W^(G, G') = J \\9 - 9'\\ r dn{9, 9'), 
and 7 be a random probability measure on x © such that law(7) = @(a, k). Let Q, Q' 
be the random marginal measures induced by the joint measure j(d9, d9') with respect to 
9 and 9', respectively. By the definition of Dirichlet processes, along with the fact that k G 
T(G,G'), we have law(Q) = S^aG^ an d law(<5') = £# a G'- Thus, the joint distribution of 
(Q, Q') is denoted by K, G T(T>, V). Since 7 is a coupling of (Q, Q') by our construction, 
we have W^(Q,Q') < f \\9 — 9'\\ r d'y(9,9') almost surely. The expectation under the 
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coupling K, satisfies: 

ew;(q,q') < eJ \\e-9'\\ r d 1 (e,e') 



J \\9 - 9'\\ r dnc(9, 9') = W;\G, G'). 



The first equality is due to a standard property of Dirichlet processes that E7 = k and 
Fubini's theorem. We deduce that W r r {V,V) < W£(G,G'). Combining with part (a) 
gives the desired identity. □ 

3.1 Divergences in hierarchical Dirichlet processes 

Recall the hierarchical Dirichlet process defined by Eqs (Q3 and Q. The marginal density 
p Y ,\g is obtained by integrating out random measures Q, see Eq Now, we provide 

[n\ I 

upper bounds on Kullback-Leibler distance K(p Y[n] \G,PY [n] \G') an d other related distances 
in terms of transportation distance between G and G'. 

Lemma 2. (a) Given assumption (Al), K(p Y[n] \ G ,p Y[n] \G') < CinW*'(G,G') and h 2 (p Y[n] \ G ,PY [n] \G') < 
C x nWi;{G,G'). 

(b) Given assumption (A2), we have x(PY [n] \GiPY [n] \G') < M n . 

Proof. To simplify notations, let V = *2i a G an d T>' = & a G'- The density p Y \ G , defined 

[n\ I 

by Eq. (O, is succinctly written as Py m \g := f(Q * f) n T^{dQ), and likewise, p Y[n] \G> '■= 
I {Q' * f) n T^ \dQ'). Due to the convexity of Kullback-Leibler divergence, by Jensen's 
inequality we obtain that for any coupling K, G T(T>, V), 

K{p Y[n] \G,PY [n] \G>) = K^j{Q*f) n K{dQ,dQ'),j\Qi*fYlC{dQ : dQ'] 

< J K((Q*fr,(Q'*f) n )JC(dQ,dQ') 

= n J K(Q*f,Q'*f)lC(dQ,dQ'). 
Using the same argument, now for any coupling k G t(Q, Q') 

K(Q*f,Q'*f) = k(J f(-\e)K(de,d9'),J f(-\9)K(d9,d9') 



< J K(f(-\9),f(-\9'))K(d9,d9') 

< [ Ci||0-0'|r K (d9,d9'). 



Since this holds for any k G t(Q, Q'), we obtain that K{Q * f,Q' * f) < C X W;{Q, Q') 
Plugging back in the upper bound for K(p Y ]|G>PYj„]|G')> we have: 

K(PY ln] \G,PY [n] \G>) < Cm inf I Wf(Q,Q') K(dQ,dQ') = C x nW r r (p,V). 
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Up to this point we have not used the fact that V and V are Dirichlet processes. By 
Lemma [T](b) we arrive at the desired inequality. The proof of the other two inequalities are 
similar. □ 



3.2 Small ball probabilities and thickness of Dirichlet priors 

By small balls we mean neighborhood of measures G G &(@) centered at some Go and 
defined in terms of Kullback-Leibler distances induced by the hierarchical model. The 
Kullback-Leibler neighborhood of a given Go G ^(0) with respect to n-vector YLi is 
defined as 

B K (G ,5) = {G G &(G)\K(p Y[n]lGo ,py [n]lG ) < 5 2 , K 2 ( P y [n]lGo ,p Y[n]lG ) < 5 2 }. (10) 

The following r esult gives proba bility bound on small balls as defined by Wasserstein 
metric (Lemma 5 of lNguyen 

Lemma 3. Suppose that law(G) = where H is a non-atomic probability measure on 
0. For a small e > 0, let D = Z)(e, 0, ||.||) the packing number of@ under \\ ■ \\. Then, for 
any G G &{&), 

f(g:^(Go,G)<(^ + 1 K ) >^-i—^ supH^)- 

Here, (S±, . . . , Sn) denotes the D disjoint e/2-balls that form a maximal packing ofQ. T() 
denotes the gamma function. The supremum is taken over all packings S := (S\, . . . , Sd)- 

Combine the previous lemmas to obtain an estimate of the thickness of the hierarchical 
Dirichlet prior: 

Theorem 3. Given assumptions (A1-A3), a bounded subset ofR d . Then, for D := 
(diam(0)) d (n 3 /(5) rf//r and constants c, C depending only on C±, M, cq, 7, 
diam(0) and r, for any Gq G ^(0), 5 > and n > Glog(l/<5), 



logF(GeB K (G ,8)) >clog 



-y D (5 2 / n Z)(l+d/r)(D-l)+Dd/r 



Remark We note the appearance of term (l/n) v in the lower bound. Accordingly in 
order to control the thickness of the prior we require n not to grow too fast. 



Proof. We shall invoke a bound of IWong and Shenl 1119951 (Theorem 5) on the KL diver- 
gence. This bound says that if p and q are two densities on a common space such that 
/ P 2 /l < M, then for some universal constant eo > 0, as long as h(p, q) < e < eo, there 
holds: K( P ,q) < C e 2 log(M/e), and K 2 (p,q) := / p(log(p/g)) 2 < G e 2 [log(M/e)] 2 , 
where Gq is a universal constant. 
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For a pair of Go, G G ^(6), if W r (Go, G) < e then by LemmaH 

h 2 (p Y[n] \G ,PY [n] \G) < tf(py [n] | Go >Py [n] |G)/2 < Cme r /2. 

We also have x(PYr n i|Go>PYr n ]|G) — M n . We can apply the upper bound described in the 
previous paragraph to obtain: 

K2{p Y[n] \G ,PY [n] \G) < CoC in e r 

If we set e r = 5 2 /n 3 , then the quantity in the right hand side of the previous display is 
bounded by C2S 2 , as long as n > log(l/<5) V (C\5 /2eo) 1 ^ 2 , where constant C2 depends 
only on Co, C\,M. Thus, 

F(G G B K (G ,5)) > F(WZ(G ,G) < C 3 5 2 /n 3 ), 

where constant C3 depends only on Co, Ci, M. Combining with Lemma and by (A3), 
H(Si) > coe d = cq(5 2 /rfi) d / r , and D = (diam(8)/e) rf to obtain the desired lower bound. 

□ 

4 Behavior in the boundary of two Dirichlet processes 

In this section we study the property of the boundary of certain sets (of measures) which 
can be used to test one Dirichlet process against another. In binary hypothesis testing, such 
a test set can be generally defined via the variational distance between the distributions that 
define the two hypotheses. However, for the purpose of subsequent development we need 
a more robust test in which the robustness can be expressed in terms of the measure of the 
test set's perturbation along its boundary. 

Let V, V G <^( 2 ) (0). The variational distance between V, V is defined by: 

V{V,V')= sup \V(A)-V\A)\. 

Aa&>(<5>) 

Here the supremum is taken over all Borel measurable sets A C In what follows, fix 

r > 1. For a subset A C the boundary set bd^4 is defined as the set of all elements 

P G such that every W r neighborhood for P has non-empty intersection with A as 

well as the complement set A c . Also define A e to be the set of all P G for which 

there is a Q G A and W r (Q, P) < e. 

Definition 1. Let T> = S! a Q for some fixed G G and a > 0. Let Q be a subset of 

and a' > 0. A class V C ^(Q) of Dirichlet processes := {®{at!G')} is 
said to have a* -regular boundary with respect to T> for some constant a* > 0, if there are 
positive constants Co, Co and c\ dependent only on D such that for each D' = 3>{ol'G') G ^ 
there exists a measurable subset B C (0) for which the following hold: 

(i) V'(B)-V(B)>coW r r {G,G), 

(ii) V{B, \B)< C e a * for any e < a W r (G, G). 



1 2 

o lo §^ — - +nlogM 

2 Ci?ie r 
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Remark The key requirement here is that constants Cq,cq and c\ are independent of 
V G <€. Consider the following example: Q := {& G Z?{Q) \ spt G n sptG = 0}. 
Take V := @ a > G' for some G' G (/. By a sta ndard fact of Dirichlet processes (e.g., 



see Thm. 3.2.4 of iGhosh and Ramamoorthil [2002]), spt I? = {P : spt P C sptG} and 



sptP' = {P : sptP C sptG"}. Thus, we also have sptP n spt I?' = 0. It follows 
that V(T>,T>') = 1. If we choose e\ = iv-io^sptG-fl'esptG' 11$ ~ Q'W > 0, and let B = 
(sptX>')ei/2. then V'(B) = 1 and V(B) = 0. Moreover, for any e < ex/4, V(B t ) = 0, 
so V{B t \ B) = 0. This construction appears to suggest that ^ := {@ a 'G'\G' 6 ^} has 
a-regular boundary with P for any a > 0. This is not the case, because it is not possible to 
have ei > ciW^G, G') for some ci independent of G' (That is, ei can be arbitrarily close 
to even as W r (G, G') remains bounded away from 0). 
The main result of this section is the following. 

Theorem 4. Suppose that O is bounded. Let T> = 3s a G> where G = Yli=i fii&dif or some 
k < oo and a, ao G (0, 1]. Define 

V = {9 a , GI \G' G 0»(0); a' G [a , 1]; W r (G', G) < e }. 

Then, for some positive constant e$ that depends only on T>, ^ has a" 'r -regular boundary 
with respect to V, where a* = min^ a/3j. 

Proof. Take any G' G ^(9) such that e := W r (G, G') < m := mmi<i^y<& [|0j - 0j||/4. 
Choose constants c\,c 2 such that c\ = C2 diam(6) r = l/2 r+1 . Let S be the union of the k 
closed Euclidean balls of radius c\e and centering at 9\, . . . , Op., i.e., S = U^ =1 B(9i, c\e). 
Any G' G &>{&) admits either (A) G'(S C ) > c 2 e r , or (B) G'(S C ) < c 2 e r . 

Case (A) G'(S C ) > c 2 e r . Let B = {Q e ^{Q)\Q{S C ) > 1/2}. Clearly, V(B) = 0. 
Moreover, for any Q G B and Q' G spt£>, W r r (Q,Q') > (l/2)(cie) r . So for any e' < 
(l/2) 1 / r cie, 2?(5 £ /) = 0. Condition (ii) of Definition Q] is satisfied. 

It remains to verify condition (i). If G'(S) = then G'{S C ) = 1 and V'(B) = 1. So, 
V'(B) - V(B) = 1. On the other hand, if G'{S) > and suppose that law(Q) = V, then 
law(Q(5)) = Beta(a / G'(5),a , G , (S c )). So, 

[ > J r(a'G'(S))r(a'G'(S-)f [ X) X 

> IV) [ 1/2 a'G'(S)-l d 

~ T(a'G'(S))T(a'G'(S-)) J 

T(a') (1/2)«' G '( 5 ) 

X 



r(a'G'(5))r(a / G / (5 c )) a / G'(5) 
> h{a')a'G'{S c ) > ~T(a' + l)c 2 e r . 

In the above display, the first inequality is due to (1 — x) 7 > 1 for 7 < 0, the second 
inequality is due to xT(x) = T(x + 1) < 1 for < x < 1. Condition (i) is verified. 
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Case (B) Let 0j = G'(Bi) fori = l,...,k. In addition, let (3' = G'(S C ) < c 2 e r . 
Consider the map $ : &>(&) -> A* -1 , defined by 

$(Q) := (Q(B{)/Q(S), Q(B k )/Q(S)). 

Define Pi := Dir(a/3i, . . . , afik) and P2 := Dk(a'f3[, . . . , a'f3' k ). By a standard prop- 
erty of Dirichlet processes, Pi and P2 are push-forward measures of V and V, respec- 
tively, by <£>. (That is, if law(Q) = V, then law($(Q)) = P x . If law(Q) = V then 
law($(Q)) = P 2 -) Define 



Si := 9£A* 



fc-1 



(This is also the same set defined by Eq. d29l > in the proof of Lemma @]). Now let B = 
<5>- l {Bi). Then, we have V'(B) - V(B) = P 2 (Bi) - Pi{B x ) = V(P 1 ,P 2 ). 

For any Q, Q' S sptP such that W r (Q, Q') < e, we must have ||q — qf'||oo < e r /m r . 
So, V(B e \ B) < Pi(q : \\q - q'W^ < e r /m r for some q' G B x ) < C e a * r , for any 
e < eo, where Co and eo are positive constants dependent only on V. The second inequality 
is essentially the proof of Lemma|4](b). 

It remains to verify condition (i) in Definition Q] We have 

v(Pi,p 2 ) = n%- iaftV %- lQ ^) 

* (2dia m( e ) r Ty;( ^'^ 1 ^) 



The first inequality in the above display is due to Theorem 6.15 of IVillanil II2008I1 (cf. 
Eq. (l30l)). while the second inequality is due to Lemma[T](a). Now, we have 

k rtl k rt/ k pi 

I i 



^(G'EA^.) ^ (c ie r^(/3>-^-)+diam(er^ 
i=l p ° 1=1 p ° 1=1 

1 m 



l-/3 



< (cie y + diam(er^-^ 7 

1 Po 



8=1 



< e r (c{ +c 2 diam(G) r ) = e r /2 r . 

The last inequality in the above display is due to the hypothesis that /3 < c 2 e r . By triangle 
inequality, 



We obtain that V(Pi, P 2 ) > (2diam(e)) r ( e /2) r - This concludes the proof. □ 
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The following lemma supplies a key argument in the proof of the previous theorem. Its 
proof is placed in Section [8] 

Lemma 4. Let V = £F q g> where G = Yli=i fii$6ifor some k < oo, a, a' > 0. Define 
tf = {£> a , G ,\G' G ^(e),sptG' = sptG}. 

(a) Ifva.m.i a/3i > 1, then has r -regular boundary with respect to T>. 

(b) Ifm&Xi afii < 1, then & has a*r-regular boundary with respect to T>, where a* = 
minj a Pi. 

5 Contraction inequalities in transportation distances 

The main purpose of this section is to obtain an upper bound of distance of Dirichlet base 
measures W r (G, G') in terms of the variational distance of the marginal densities of ob- 
served data V(py. ,|c?,py; This bound is given in Theorem [5] The difficulty of this 
result lies in the fact that to calculate Py m \g from G, one has to integrate out random mix- 
ing measure Q distributed according to Dirichlet process S> a c- The proof of Theorem [5] 
exploits the convergence of mixing measures that arise from a point estimate, a fact that is 
prepared by the following lemma. 

Let Q be a subset of and T = {Q * f\Q £ Q}. Let Q k C ^(G) be subset 

of measures with at most k support points. = {Q * f\Q G Qk}- Given an iid n-vector 
YLi = (Yi, . . . ,Y n ) according to the convolution mixture density Q * f fo r some Qn G Q. 



Let r?n , be a sequence of positive numbers converging to zero. Following IWong and Shen 
[1995] we consider an rj n -MLE (maximum likelihood estimator) f n G T such that 



^ n 1 " 

- Y] log f n (Yi) > sup - V log g{Yi) - 7? n . 

By our construction, there exists Q n G Q such that f n = Q n * f. 

Lemma 5. Suppose that Assumption Al holds for some r > 1, C\ > 0. Let r] n satisfy 
Vn < c i e n> e n ~ > at a rate to be specified. Then the rj n -MLE satifies the following bound 
under Qq * f -measure, for any Q$ G Q: 

HHfn,Qo * f) > £n) < 5exp(-c 2 ne^), 

nW 2 (Q n ,Qo) > 6 n ) < 5exp(-c 2 n e 2), 
where c±, c 2 are some universal positive constants. e n and 5 n are given as follows: 

(a) e n = C 2 (logn/n) r / 2d , if d > 2r; e n = C 2 {\ogn/n) r ^ d+2r ^ if d < 2r, and e n = 
(\ognf/ 4 /n l / A ifd = 2r. 



(b) e n = C 2 n 1//2 logn, if Q = Qk and T = Tk for some k < 



oo. 
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(c) If f is ordinary smooth with parameter j3 > 0, then 5 n = C^en +( ~ 2,8+1)d for any 
d! > d. 



(d) If f is supersmooth with parameter (3 > 0, then 5 n = C3 [— log e n ] 



Here, C 2 , C3 are different constants in each case. C 2 depends only on d, r, and C\, while 
C3 depends only d, (3, and C 2 . 

Proof of this lemma is in Section [8] Here is the key theorem of this section. 

Theorem 5. Suppose that is a bounded subset of M. d , (AI ) holds for some C\ > 
and r = r$ > 1. Let G is a fixed probability measure with k < 00 support points and 
a £ (0,1]. Define a* = mine gsp tG aG({6}). Let 5 n and e n be vanishing sequences 
specified by Lemma\5\ Then, for any r G [1, 2], there is a universal positive constant C2, 
and positive constants cq, c' , Cq depending only on G, such that for any G' £ and 
n large so that 25 n < c^W r (G, G'), there holds 

c Q W r r {G,G') < V{P Y[n] \ G ,PY [n] \G>) + C (25 n ) a * + 10exp(-c 2 n4). 

Proof. By Theorem |4] (applied for W r ) there are positive constants Co, Co, c' independent 
of G' such that there exists a measurable set B C ^(0) that satisfies (i) V(B) - V(B) > 
c W r r (G, G') and (ii) V(B e \B)< C e a * for all e < c' Q W r {G, G'). 

Recall that Q n is a point estimate of Q defined earlier in this section. By the definition 
of variational distance, for any 5 > 

V(P Y[n]]G ,P Y[n]]G ,) > ¥(Q n e B S \G') - F(Q n G B S \G). 

Here, P(-|G) is taken to mean the probability of an event given that the observations are 
generated according to the Dirichlet base measure G. Set B$ := {Q £ ^(0)|there is Q' G 
B such that W r {Q, Q') < 8}. We have 

F(Q n GB s \G') > P(Q n G B S ,W r (Qn,Q) < S\G') 

> P(Q G B,W r (Q n ,Q) < S\G') 

> V'(B)-F(W r (Q n ,Q)>5\G'). 

We also have 

P(Qn G B S \G) < P(Q n G B B , W r (Q n , Q) < S\G) + P(W r (Q n , Q) > 5\G) 
< P(Q G B 2S \G) + nw r {Q n , Q) > 6\G) 
= V(B 2S )+P(W r (Q n ,Q)>5\G). 

Hence, 

V(P Y [G ,P Y [G ,) > V'(B)-V(B 2S )-2su P F(W r (Q n ,Q)>5) 

> (V'(B) - V{B)) - V(B 25 \B) — 2 sup P(W r (Q n , Q) > 5). 

QeS 
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Since r G [1, 2], W r (Q n , Q) < W2{Q n , Q)- Choose 5 := 5 n as specified in Lemma|5] 
Then, for n large so that 25 n < d W r (G, G'), we have 

V(P Y[n] \G,P Y[n] \Gi) > c W r r (G,G') - C (26 n ) a * - 10exp(-c 2 ne2). 

□ 



6 Proof of main theorems 
6.1 Proof of Theorem [I] 

Given G G ^(0), define the Wasserstein ball centered at G as: B Wr (G,5) := {C G 
^(0) : W r (G, G') < <5}. A useful quantity for proving posterior conc entration theorem s 
is the Hellinger information of Wasserstein metric for a given set (see also lNguyenl 

Definition 2. Fix Gq G &(Q). Q is a subset of For a fixed n, the sample size of 

Y[ n ], define the Hellinger information ofW r metric for set Q as a real-valued function on 
the positive reals ^g t n ■ ^+ — > 

^ W:= Ge6; ^Wc,G)>V2^ ( ^ |G0 ' P ^ |G) - 
Recall that p Y i ^ is the marginal density of n-vector YLi obtained by integrating out 

[n\ I L J 

the generic random measure Q via Eq. (f3]). We also define <J>g >n : M + — > R to be an 
arbitrary non-negative valued function such that for any 5 > 0, 

sup h 2 (p Y M \G,PY M \G') < *0,n(^)/4- 
G.G'eS; W r (G,G')<4>s,n(<5) 

In both definitions of $ and we suppress the dependence on (the fixed) Go and metric 
W r to simplify notations. Note that if Go G Q, it follows from the definition that &g, n (5) < 
5/2. 

Now we state an abstract convergence theorem for hierarchical models for grouped data. 
The general model is defined as follows 

G~I1 G ', Qi, . . . ,Q m \G ~ Vq 

Y [n]\Qi ~ Qi * / for i = 1, . . . ,m, 
where T>g G (0) is parameterized by G. 
Theorem 6. Suppose that 
(a) m — > oo and n — > oo at a certain rate relative to m, 
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(b) There is a sequence of subsets Q m C a large constant C, a sequence of 

scalars e m n — > defined in terms of m and n, me^ n — > oo, such that 



sup logD($g m , n (€),a m nB W r r (Gi,e/2),W P ) (12) 
+ log D(e/2, Q m n B Wr (G , 2e) \ B Wr (G , e),W r ) < me^ n 

n G (^(G) \ m ) < exp[-me^ n (G + 4)], (13) 
nG(Sjc(G , e m , n )) > expf-me^G]. (14) 

(cj There is a sequence of positive scalars M m such that 

^ m .,n(M m 6 m , n )>86^ n (C + 4) (15) 

exp(2me^ n ) ^ exp[-m^g min (je mjn )/8] -> 0. (16) 

j>M m 

Then, U G (G : W r (G ,G) > M m e mtn \Y^ ] ) -> m | g Q -probability as m and 
n — > oo. 



The proof of this is similar to Theorem 4 of iNguyenl H2012bH and is omitted. We shall 
now apply Theorem|6]to derive concentration rate in W\ of the posterior of G that arise from 
the hierarchical Dirichlet process. The calculations of rates are organized by a sequence of 
steps. 

Step 1 Choose Q m := so Eq. (TT3l trivially holds. Since is bounded, As- 

sumption (Al) holds for r > 1 also implies that it holds for r = 1. By Lemmata), 
h 2 (pY [n] \GiPY ln] \G') < nCiWi(G, G'). By the definition of $g, n (5) for Wi, it suffices to 
set §g" n (8) := % n (5)/(4nGi). 

To verify condition expressed by Eq. ([L2l . let us first derive a lower bound on Qg, n , by 
appealing to Theorem |5J 

*C, n (a;) = ^, n (w)/(4nGi) 
G:Wi(G ,G)>w/2 ^r [n] |u ^r [n] |u// \ 



> (cqw/2 - G (25 n r - 10exp(-c 2 ne^))7(4nG 



where the last inequality in the previous display holds whenever 25 n < c' uj. The rates e n 
and S n are defined in Lemma [5] and constants Co, c , Go, c 2 are dependent only on the fixed 
Go- We shall require n to grow at a certain rate so that for any u > e mj „, 

C (25 n ) a * + 10exp(-c 2 n4) < c e m , n /4 < c u;/4. (17) 
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This yields 3>g tn (u) > 2cui 2 /n for some constant c > whenever oj > e OT)Tl . As a result, 
sup D(9g m!n (co),g m nB Wl (G 1 ,oj/2),W 1 ) < N(cco 2 /n, W x ) 

< (ndiam(0)/cw 2 ) d log(e + endiam(6)/cu; 2 ) <me 2 nn /2, 



where the last equality in the previous display is an application of Lemma 4(b) of Nguyen 
I 2012all . while the last inequality if e m ^ n satisfies: 

e m ,n>[n d log(mn)/m] 1 /(2^2). 
With this condition on e mj „, for any u > e mj „, 

]agD(u/2,0'(Q),Wi) < logN(u/4,&>(G),Wi) 
< iV(w/8,9,|| • ||)log(e + 8ediam(6)/w) 
= (8diam(9)/u;) d log(e + 8ediam(9)/w) < me 2 m j2. 

Hence, Eq. (fl2l is verified. 
Step 2 By Theorem [3 

- log n(Sjf(G , €„»,„)) < 
-D log 7 + [(1 + d/r)(D - 1) + Dd/r] log(n 3 /4,n) + log(l/c) 

< C(n 3 /e 2 ra ,J d/r log(n 3 /e 2 ra , n ) < Cme 2 ^ 
for some constant C > 0, where the last inequality holds if 

e m ,n > [n M log(H/m] 1/(2<i+2) > [n 3d/r log(nm)/m] r ^ 2d+2r \ 

Since ^g n (w) > cu 2 for some c > 0, it is simple to check that both Eqs (fT5l > and (fT6l ) will 
follow by setting M m to be a sufficiently large constant. 

Step 3 It remains to clarify the condition on n as specified by Eq. (fTTT ). We consider two 
cases. First, if / is an ordinary smooth kernel density, then by result from Lemma 12c), we 
have S n x e 2 i /( 4 +( 2 / 3 + 1 ) d ) f or ar bitrary d' > d. By Lemma[5lb) we obtain e n x rT x l 2 log n 
because the number of support points for the true base measure G (and subsequently Qi's) 
is bounded by a constant. To satisfy Eq. (fTTT ). n has to grow with m so that 

((logn) 2 /n) Q * /(4+(2/3+1)d,) < 0(1) x [T, M logH/m] 1/(M+2) . 
A sufficient condition is n 3d/(2d+2)+a*/(4+(2/3+i)d') > m i/(2d+2)_ Note that n should 
also grow sufficiently slowly so that e m ,n — > 0. In summary, the posterior concentration 
rate e mjn x (n 3d log m/m) l ^ 2d+2 ' > is obtained if n tends to infinity at the rate 

4+(2/3+l)d' 



in 



3d(4+(2/3+i)d')+(2d+2) Q * < n < (m/logm) 1 / 3 ^. 
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Step 3' Consider the case / is a supersmooth kernel density, then from Lemma|5](b) and 
(d) we have e n X n -1 / 2 logn and 8 n >c [— loge n ]~ 1 / /3 x [logn] -1 /' 3 . To satisfy Eq. (fT71) 
we need 

(logn) -01 */' 9 < 0(1) x [n^log^/m] 1 /^ 2 ) 
In this case our asymptotic theory is applicable for the following range of n relatively to m: 

m < 3d < m 

logm(logn) a *( M + 2 )/' ? ~ ~ logro - 

6.2 Proof of Theorem H 

Recall that for each h, e m ,n = £m,n(n) is a net of scalars indexed by m,n that tend to 
0. Define A%1 := {G : W 1 (G,G ) > e m , n } and := {Q : h(Q *f,Q* *f)> 
C((log n/n) 1 ^ d+2 "> + em,n log(l/e TO n ))} for some large constant C. Due to conditional 
independence, 

n Q (Q G B£l\Y^Y$) = J Uq(Q G B^\G t Y^) dna(G\Yfa,YW) 
< f U Q (Q G B^\G,Y^) dUdGlY^Yff) 

+ U G (GGAH\Y [ l ] ,Y^ ] ). 

For each h, the second quantity in the upper bound tends to in P Y o |q* x Py [n] im- 
probability, as m, n — > oo at suitable rates by condition (d) of the theorem. Now, as n — > oo, 
the first quantity tends to as a consequence of Theorem [7] This completes the proof for 
(i). Parts (ii) and (iii) are proved in the same way. 

7 Posterior concentration under perturbation of base measure 

The main purpose of this section is to prove a posterior concentration theorem for mixture 
distribution Q * f, conditioning on the event that base measure G is a small perturbation 
from the true base measure Go- This result (Theorem |7J supplies the key part in the proof 
of Theorem|2] It also highlights the benefit of borrowing of strength in hierarchical models. 
We shall prepare with three technical lemmas, whose proofs are given in Section [8] The 
first lemma demonstrates gains in the thickness of the conditional Dirichlet prior (given a 
perturbed base measure) compared to the unconditional Dirichlet prior. 

Lemma 6. Given G = Ya=i Pi$0i and smal1 e > 0. Let G G &(@) such that W\ (G, G ) < 
e. Suppose that law(Q) = 2t a G> where a G (0, 1]. 

(a) For any Qq G such that sptQo C spt Go, and any 5 such that 5 > maxj<fc 2e//3j 
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and 5 < minj j<fc — 0j\\/2, any r > 1, f/iere /jo/Js 

W(Q»,Q) < 2^) > r ( aO(a./2)'( SI i- (§y )' ,+ " 1 n ft . 

addition, suppose that (A1-A2) hold for some r = ro > 1. 77ien, f/iere are corc- 
stants C, c > depending only on a, /c, Ci,M, diam(0), ro an<i smc/i that for 
any 6 such that 5/ log (1/5) > Ce r °/ 2 , 

P(Q G B K (Q ,S)) > c^/logll/^)) 2 ^- 1 ). 

This should be contrasted with the general small ball probability bound of Dirichlet 
process as stated by Lemma [3] In that lemma the base measure is an arbitrary non-atomic 
measure, while the lower bound is applied to any small W r ball centering at an arbitrary 
measure. The lower bound is exponentially small in the radius. In the present lemma, the 
base measure G is constrained to being close to a discrete measure Go with k < oo support 
points, while the lower bound is applied to small W r balls centering at Qo that shares the 
same support as Gq. As a result, the lower bound is only polynomially small in the radius. 

The following lemma relies on the intuition that the Dirichlet measure concentrates 
most its mass on probability measures which place most their mass on a "small" number of 
support points. 

Lemma 7. Let D := 3) a G an d r > 1. For any 5 > 0, and for any k G N+, there is a 
measurable set C ^(0) satifies the following properties: 

(a) sup QeBfc inf Q / eQfc W r (Q, Q') < 5. 

(b) log N(S,B k ,W r ) < fc(logJV(<5/4,G,|| • ||) + log(e + 4ediam(G) r /5 r ))- 

(c) There holds 

V{0>(Q)\B k ) < A;- fc ((5/diam(G)) Qr [earlog(diam(e)/5)] /c . 

To see that the set B k has small entropy relative to ^(0), we note a general estimate 
for ^(0), which gives an upp e r boun d that is exponentially large in terms of the entropy 
of 0: (cf. Lemma 4 of lNguvenl J2012ah ): 

logN(6,0>(e),W r ) < N{5/2,S,\\ ■ ||)log(e + 2ediam(0)7(5 r ). 

In Lemma|71 the bound on entropy of Bk increases only linearly in the entropy of 0. How- 
ever, it also increases with k, which controls the measure of the complement of Next, 
we consider the additional assumption that the Dirichlet base measure is a small perturba- 
tion of a discrete measure with k support points. The strength of this result compared to 
the previous lemma is that the entropy estimate depends only linearly on the entropy of 0, 
while k is fixed. The measure of the complement set of B is controlled only by the amount 
of perturbation. 
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Lemma 8. Given e > 0, k < oo,r > 1. Let Go,G £ £?{&) such that that Go has k 
support points and W\(G, Go) < e. Let V := Q) a Qfor some a > 0. For any 5 > 0, there 
is a measurable set B C that satifies the following. : 

(a) log N(S, B, W r ) < fc(log AT(<5/4, 9, || • ||) + log(e + 4e diam(6)75 r )). 

(b) V(0>(e)\B) < ediam(e) r - 1 /(5 r . 



Theorem 7. Let 6 be a bounded subset ofR d . Assumptions (A1-A2) hold. Let Q G 

such that sptQo C sptGo, where Go = Yli=iPi^di f or some k < oo. Let Uq be an 

arbitrary prior distribution on &(&). Consider the following hierarchical model 

G ~ lie, Q\G ~ n Q :=^ aG , 

y [n] = (yi,...,y n )|Q ~ q*/. 

Le? e n 4- and define events £ n := {W\{G, Go) < e n }. TTierc, ?/ze posterior distribution of 
Q given Yj n ] admits the following as n — > oo: 

n Q ^(Q *f,Q *f)> S n Y [n] ,£ n ^j -> 0, (18) 
n ^ 2 (Q,Q ) > M n 5 n y w ,^ -> (19) 
in (Qo * /) x Hc-pfobability, where the rates S n and M n S n are given as follows: 
(i) S n x (log n/n) 1 /^) + e ;«/ 2 log(l/ en ). 

2 

(//) If f is ordinary smooth with smoothness /? > 0, M n 5 n x <5^ +(2,3+1)d . 

7/"/ is supersmooth with smoothness (3 > 0, M n S n x (— log (Jn) -1 /' 3 . 

7/e„ | suitably fast, then the following rates for S n are valid: 

(iv) If f is ordinary smooth, and e n — > sufficiently fast such that 

e n ^ n _ ( a+fc+4M °)(logn)~( Q!+A; ~ 2 )}, w/iere Mo zj some Zarge constant, then S n x 
(logn/n) 1//2 . 

(vj If f is supersmooth with smoothness (3 > 0, a«ci e„ — >■ sufficiently fast such that 

e n < n- 2 ( a + fc )/^+ 2 )(logn)- 2 ( Q + fc - 1 ) exp(-4n^+ 2 )), rten <5„ x (l/n)V(^+2). 

/Voo/ The proof is organized in the following steps. 
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Step 1 Let Py\q denote the mixture density Q * f. Define the Hellinger information of 
W2 metric for a set Q C 2?(&) given Qq: 

tp Q (S) = inf h 2 (p Y \ Q py\q). 

Also define 4>q : M + — > R to be an arbitrary non-negative valued function such that for any 

5>0, svp Q>Q , eQ . W2{QtQI) <t Q{s) h 2 (p YlQo ,p YlQ ) < ^ Q {5). Define 

B K (Qo,6) = {Qe &(9)\K(p YlQo ,p YlQ ) < 5 2 ,K 2 (p YlQo ,p YlQ ) < 5 2 }. (20) 

The first step of the proof involves obtaining the following result: Suppose that there is a 
sequence 5 n — > such that nd 2 , — > 00, a sequence of scalars M n , a sequence of subsets of 
measures S n C &(Q) and the following hold: 

log D(5/2,S n n B Wa (Q , 25) \ B W2 (Q , S),W 2 )+ (21) 
sup log D^sS^iSn^Bw^Gx, 8 /2),W 2 ) < n5 2 n V 5 > S n , a.s. 

Gl£iS„ 

W^W,^) (22) 



U(B K (Q ,5 n )\£ n ] 
U(S n n B W2 (Qo, 2j<5 n ) \ B W2 (Q ,jS n )\£ n 



<exp[n^„(j'<Jn)/16] (23) 



n(^(Q ,<5 n )|£ n ) 
for all j > M n , a.s. 

exp(2n<£) exp[-nVs n O'<5 n )/16] 0. a.s. (24) 

j>M n 



Here, the almost sure statements are taken with respect to YIq- Then, both Eq. (1181) and ( |19b 
h old. The proof o f this step is a straighforward modification of the proof of Theorem 4 



in 



Nguyen [2012a] and is omitted. 



Step2 By assumption (A2) and Lemma[l/i 2 (py|Q,py|Q/) < K(p Y \Q,p Y \Q>) < CiW r r {Q,Q') 
for r = tq. Since 

W r r (Q,Q') < diam(e) r - 1 W r 1 (Q,Q / ) < diam(9) r - 1 iy 2 (Q, Q')> 

a valid choice for <ps is 05 (5) = 4gl diam^e)^- 1 • ^ ecau tne following facts (Proposition 1 
of Nguyen I 2012atl ): If / is an ordinary smooth density function on M. d with parameter j3 > 
0, for any d! > d, V^(e)(<5) > c(d, p)5±+( 2 P+ l ) d ' for some constant c(d, p) > 0. If / is a 
supersmooth density function on M. d with parameter /3 > 0, V^(e) (5) > exp[— c(d, (3)5~^] 
for some constant c(d, f3) > 0. These give immediate lower bounds on 4>s(5) > 4>&>t&\ (5). 
Note in the following that constants c may be different in each appearance. 
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Step 3 Consider the ordinary smooth case. We shall construct a sequence of subsets <S n 
and a sequence 5 n — > such that both Eq. (|2TT) and Eq. (1221 hold. In fact, we set S n := B n , 
whose existence and properties are given by Lemma|8] Note that the choice of B n is random 
because it depends on G, in addition to 5 n . For any 5 > 6 n , ilc-almost surely we have 

log D(6/2, S n n B W2 (Q , 25) \ B W2 (Q , 5), W 2 ) 

< log N{6/A,B n ,W 2 ) 

< fc[(log7V(«5 n /16,e,|| • ||) + log(e + 16ediam(e) 2 /^)] 

< fcdlog(diam(e)/5 n ). (25) 

In addition, lie-almost surely, for any 5 > 5 n , 

sup log D(4> Sn {6) , S n n B Wa (Gx , 6/2) , W 2 ) 

< log D{<j> Sn (6),S n , W 2 ) < log N(c5 4 ^ 2 ^ d ',B n ,W 2 ) 

< k{logN(c6t+W+V d '/4,e,\\-\\) 

+ log(e + 4e diam(e) 2 /c^ +2(2/3+1)ci ' )] 

< fcd 2 log(diam(e)/<5 n ). (26) 

From Eqs. (1251 and (1261 , by setting 5 n = M (log re/re) 1 / 2 , where M is a positive constant 
depending on k, d, diam(0) and f3, the entropy condition expressed by Eq. (1211 immedi- 
ately follows. 

According to part (b) of LemmaH U(^>(Q)\S n \£ n ) < e n diam(G)/<5 2 . ByLemmaH 
if 6 n satisfies 5 n / log(l/5 n ) > e n ^ 2 , then ilc-almost surely, 

U(B K (Q ,6 n )\S n ) > c(5 n /log{l/5 n )) 2 ^ +k - l \ 
It follows that if 6 n > e n o/2 log(l /e n ), then 

mP^Tw\ Z e n 6-^[log(l/5 n )}^^ diam(G)/c. (27) 
ll{B K {Q , o n )\b n ) 

From Eq. (1271 ), the condition expressed by Eq. (1221 is verified if 

5 n >e n o/2 log(l/e n ) and 

6 2 n (a+k) > enexp(4n5 2 )(log(l/«5 n )) 2 ( Q+fc - 1 ). 

It is easily seen that both inequalities hold for the rate 6 n = Mq (log n/n) 1 / 2 , if e n < 
min{ (n log n)- 1 / 7 " , „-(«+''+4Mo) ( log ^-(a+fc-2) 

}. In other words, if e n tends to suf- 
ficiently fast, we obtain parametric rate of convergence (log n/n) 1 / 2 for the posterior of 

Q*f. 

It remains to verify Eqs (1231 and (|24l . It suffices to construct a sequence of M n so 
that ^ Sn (M n 5 n ) > C5 2 n for a large constant C > 0. Since ^s n (6) > ^+(2/3+l)<i' for any 

2 

<f > <i, M n can be chosen so that M n 5 n > 5n 1)d . 
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Step 4 We turn to the case of supersmooth kernel density /. S n = B n is constructed in 
the same way as in Step 3. Bounds (T25T ) and (l27l) remain valid, but Eq. (1261 ) is replaced 
(using the bound on <fis n described in Step 2): for any 5 > 6 n , 

sup log D{<j> Sn (5) , S n n B W2 (Gi , 5/2) , W 2 ) 

< log D(cj> Sn (6),S n , W 2 ) < logN(exp[-c5-%B n ,W 2 ) 

< A;[logiV(e X p[-^]/4,G,||-||) 

+ log(e + 4ediam(e) 2 exp[c^ /3 ])] < kd5~ p . (28) 

From Eqs. (1251 and (l28l . by setting 5 n = Mo(l/n) 1 /^ +2 \ where Afo is a large con- 
stant depending on k, d, diam(0), the entropy condition expressed by Eq. (1211 imme- 
diately follows. As in the previous step, if e n tends to sufficiently fast, i.e., e n < 
n -2(a+fc)/(/3+2)( logn )-2(a,+fc-i) eX p(_4 n /3/(/3+2) ^ then Eq _ is satisfied for the given 

choice of 5 n . That is, we obtain a parametric rate (l/n) 1 /^ 2 ) of posterior concentration 
of Q * / that does not depend on d. 



Step 5 The previous two steps established (iv) and (v), i.e., deriving rates 5 n when e n 
tends to sufficiently fast. In the final step we derive S n that is valid for any e n . Here, we 
turn to the choice 6V , := &(@): this is in fact a convex set, so we can appeal to Theorem 
3 of Nguyenl 1 2012a], which avoids having to upper bound the covering number in terms of 
W 2 radius 4>s n (^)- (Indeed, a similar calculation has been carried out in their Theorem 6, 
which gives posterior concentration rates of mixing measures for the stand-alone mixture 
model Q * f, which yields the rate (log n/n) 1 /^ 2 ^). We omit the similar derivations here, 
which give 5 n x (log n/n) l ^ d+2 " > + el?^ 2 log(l/e„) as the posterior concentration rate of 
the mixture density Q * f. The extra quantity depending on e n is needed so that Lemma[6] 
can be applied. □ 
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8 Appendix: Proof of technical results 
8. 1 Proof of Lemma U 

(a) A random measure Q distributed by V can be represented as Q = Yli=i where 
q = (qi, . . . , gfe) is a Dirichlet vector: law(q) = Dir(a/?i, . . . , af3k). Both probability 
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measures T> and V are supported by 9\, ... ,6k, they are absolute continuous with each 
other and admit Radon-Nikodym density ratio. Let B be the set of measures at which the 
density value under V is strictly greater than the density value under V. That is, 



(29) 



bd B = { E ^ E A * lo S * = lo S + E lo S r («' A') - lQ g r ( 
^ i=i j=i ^ a ' j=i 

Here, Aj := aft - a'/3-. We have 

= V\B) - V{B) = V(Vh(aP l: . . ., a p k ),mt(c!&, . . . ,a'f3' k )). 
Sin ce Was s er stein distance is bounded by the weighted total variation, (cf. Theorem 



6.15 of lVillanil 120081 ). we have: 

W r r (V,V) < I'- 1 J W r r (G , G)\V - V'\{dG) < 2 r di&m(&) r V(V,V'). (30) 

Thus, V(V,V) > W;{V,V')/(2diam(e)) r = W r r {G, G')/{2 diam(9)) r , where the 
second equality is due to Lemma[TJ So, condition (i) in Definition Q] is satisfied. 

By Lemma [9] (stated at the end of this proof), it remains to obtain an upper bound for 
V((hd B) e ). Define Q to be the subset of vectors q = (qi, . . . , € A fc_1 , which are 
constrained by the equation that defines bd-B. Q can be partitioned into disjoint subsets, 
according to smallest and largest elements among Aj/^j for i = 1, . . . , k. There are k(k— 1) 
such subsets, some of which may be empty. Consider, for example, a subset, say S C Q 
for which A\/q\ is the greatest and A 2 /q 2 is the smallest. Suppose that within set S, 
A-i/qi > A 2 /q2 (strictly greater). By the implicit function theorem, q\,q 2 can be written 
as differentiable functions of q$, . . . , q k such that for each i = 3, . . . , k: 

dqi _ _AiJqi-^A2/q2_ 
dqi Ai/gi-A 2 /g 2 ' 

dq2 _ Aj /qj - Ai/gi 
dqi A 1 /q 1 -A 2 /q 2 ' 

Note that both | dqi / dqi | < 1 and | dq 2 / dqi | < 1 for all q G S. It follows that the number 
of £oo balls in A k ~ 1 of radius e needed to cover S is less than (l/e) fe ~ 2 . If, however, 
Ai/qi = Aj/% for all i = 2, . . . , k, then set 5 is a singleton, which is trivially covered by 
a single ball. This implies that the e-covering number in metric for Q is less than 
k 2 (l/e) k - 2 . 

It is simple to see that for any Q in 

k 

{bdB) e n spt£> = {Q = ^ qi5e % \W r {Q, Qf) < e for some Q' G bdB} 

i=i 

there must be Q' = Yj%=i G bdB such that the distance \\q — q'\\oo < e r /m r , 
where m = mirijj<fc \\6i — 9j\\. So (bd B) € n spt V can be covered in W r metric by proba- 
bility measures whose corresponding mixing probability vectors q forms an 5-covering for 
Q, where 5 := e r /m r . 
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Given that a/3« > 1 for alii = l,...,k, under Dir(a/?i, . . . , afik) the measure of a 
(k — 1) -dimensional 5-ball in metric is bounded from above by C5 k ~ l , where C is a 
universal constant. As a result, V((bdB) e ) < k 2 (l/5) k ~ 2 x CS^ 1 = C e r for some 
Co > independent of Q'. 

(b) This part requires a more careful estimate of the upper bound for V measure on 
(bd-B)e. Let Q* be a (5-covering of Q in metric, where as in part (a), 5 = e r jm r . 
Consider an element q* G Q*. There is at least one index, say j* G [1, fe] such that q*„ > 
1/k. Without loss of generality, assume that j* = k. Then for 5 < I /2k, \q^ — q%\ < 5 
means that q^ > 1/k — 5 > I /2k. Under Dir(a/5 1; . . . , a/3k), the measure on the ball 
of radius 5 and that centers at q* is: 

n\\q-q*\\oo<5) 



fc-i 



lli=i r (a«) i= i 

Here, (a)+ := a V 0, and (a)++ := a A 1. The inequality in the above display is due to 
a* < afii < 1. It is simple to verify that 

m + sr/i - & - *)?*]/(«/%) < 

if we set A(g?,<5) := (3<5) aft /a* if g* < 2<5, and A(q*,S) := 25[{t - l)^" 1 = 2(t - 
yaPi-igaPi jf q* e ^ ^ _j_ f or some natural number t > 2. So, we obtained an upper 
bound for the quantity in the previous display: 

P(ll«Z-«zloo < 8) < C(a,/3,k) A( q ;,S), 

where C(fy ft k) — (V™)"* -^(a) 

where C - n * =i r(a ft ) " # 

To show that £>((bd !?)<$) = 0(<5 a * ), we only need to show that for a certain (5-covering 
set for Q, say Q*, 

J2 l[A(q*,6)=0(6"*). 

<7*eS* i^j* 

Consider a specific covering Q* = {q* = (q*, . . . , q^)} where q*'s take values in the 
set {tS\t G N, t < 1/(5}. Recall from part (a) that Q is a fc — 2 dimensional manifold 
such that two elements among the q* for i = 1, . . . , k can be uniquely determined by the 
remaining k — 2 elements. Except potentially q*„, there is still another element among 
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q*, . . . , ql_i that is determined by the remaining elements. We denote that element by q*, 
for some i* = 1, . . . , k, i* / j*. 

By the definition of A, if q** G (t£, (t+l)<5] for some t > 2, A(q£,<y) < 2<5 aft * < 25 a * . 
lfq*„ < 25, A(q*,,S) < {35) a */a* for 5 < 1/3. So in any case, A(q*»,8) < 2{35) a * /a* 
as long as 5 < 1/3. Hence, 



e n^v) < 2(35rv«* e n ^*>*)- ^ 

9*62* i¥=j* q*eQ* 

Subdivide set Q* into disjoint subsets according to the corresponding pairs. There 

are at most k(k — l)/2 such subsets. Consider one such subset, say Q* l2 distinguished by 
= (1,2), we have 

e n ^*»^)< n e^m- 

For any i = 1,... ,k, V » A(^,<5) < 2(3<5) a * /«*+ 2 ^ ft Et^ (t-l) al3t ~ l < 2(35) a * /a* + 
2S aPi j-i/* x ap i -\ dx < 2(35) a * /a* + 2/a* < 4/a* for 5 < 1 /3. It follows that the RHS 
in Eq. 42D is bounded from above by (S5) a * /a* x - l)(4/a*) fe_2 . We conclude the 
proof. 

Lemma 9. Suppose that subset of probability measures B C &(@) is defined by B = 
{Q £ ^(@)\F(Q) < 0}, where F : ^(0) ->■ R is a continuous functional (i.e., F(Q) -> 
F(Qo) whenever W r (Q, Qo) tends to 0). Then, 

(a) bdB = bdB c = {Q £ &>(&)\F(Q) = 0}. 

(») For any e> 0, B e \B C (bd £) e and (£ c ) e \ B c C (bd B) e . 

Proof, (a) is immediate from the definition and the continuity of functional F. (b) is proved 
if we can show that if Q £ B and Q' £ B c such that W r (Q, Q') < e, then there exists 
P £ bdB such that W r (Q,P) < e and W r (Q',P) < e. Indeed, consider collection of 
measures Q t = tQ + (1 — t)Q' for t £ [0, 1]. By the convexity of Wasserstein metric, 
W r (Q t , Q) < tW r (Q, Q) + (1 - t)W r (Q', Q) < e. Note also that F{Q t ) is a continuous 
function of t. If either F(Q ) = or F(Q X ) = (that is, Q £ hdB or Q' £ bdB, 
respectively), then we are done. Otherwise, we have F(Qq) > and F(Q\) < 0. So there 
exists t £ (0, 1) such that F(Q t ) = 0. It follows that Q t £ bdB and that W r (Q, Q t ) < e 
and W r (Q',Q t ) < e. □ 



8.2 Proof of Lemma \5\ 



Part (a) is an application of Theorem 2 of IWong and Shenl ll 199511 , which is restated as 
follows: Suppose that e = e n satisfies the following inequality: 



V2t 



e 2 /2 8 



log N(u/c 3 ,T, h) 



1/2 



du < C4n 1 ^ 2 e 2 . 



(32) 
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where C3 and C4 are certain universal constants (cf. Theorem 1 of IWong and Shenl 11995]). 
Then, for some universal constants ci,C2 > 0, if rj n < cie 2 , the following probability 
bound holds under Q * /-measure, for any Q S Q, 

KKLQo * f) > en) < 5exp(-c 2 ne 2 ). 



It remains to verify the entropy condition (1321 given the rates specified in the s tatement 
of the present lemma. By Assumption A2 and Lemma |2] (see also Lemma 1 of Nguyen 
ll2012a[n we have h 2 {Q * f,Q' * f) < dW^(Q, Q'). This implies that 

N(u/c 3 ,T,h) < AT(( U 2 /c|C' 1 ) 1 / 2r ,g,W 2r .). 

By Lemma 4(b) of jNguvenl . l2012ah . 

\ogN{25, Q,W 2r ) < N(S,Q,\\ ■ ||) log(e + ediam(e) 2 7<5 2r ). 

Since 6 C R d , N(5, 9, || • ||) < (diam(6)/5) d . So, 



\p2e 
2 /2 8 
< 



log N^/clC^^Q^r) 



' ul/r e 



n l/2 







rV2e 






H 







du 



log(e + ediam(G)^2 /r c^C 1 /« 



1/2 



r/u 



< / (2di a m(Q)) d / 2 4 /2r cf /4r u- d / 2r [log(e + e diam(e) 2r 2 2r c|C 1 /u 2 )] 1 / 2 ^ 



For Eq. (1321 to hold, it suffices to have the right side of the inequality in the above 
display bounded by c/±n l / 2 e 2 . Indeed, this is straightforward to check for the rates given in 
part (a) of the lemma. 

Part (b) of the lemma is proved in the same way, by i nvoking a tighter bound on the 
covering number (given by Lemma 4(a) of INguyenl ll2012aIT ): 



log(2S, Q k ,W 2r ) < k(logN(5,Q, 



+ log(e + ediam(e) 2 7J 2r )). (33) 



Parts (c) and (d) are immediate consequences of part (a) and (b) by invoking Theorem 
2of lNguvenl J2012all . 



8.3 Proof of Lemma |6] 

(a) Let Bi be the closed ball in 6 with radius 5 > and centered at 6>j for i = l,...,k. 
Suppose that 5 is sufficiently small so that 5 < mhij \\9i — 6j\\/2. That is, Bi's are 
disjoint subsets of 6. Let gi = G(Bi) for i = 1, . . . , k and go = 1 — J2i=i 9i- Since 
W±(G, Go) > 5{j3i — gi), it follows that Pi — gi < e/5. By the condition on 5, gi > 
Pi — e/5 > Pi/2 for any i = 1, . . . , k. We also obtain go < e/5. 
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Suppose that go > (the case that go = is handled similarly, yielding a stronger 
lower bound). Let qi = Q(Bi) for i = 1, . . . , k and qo = 1 — J2i=i Qi- By assumption, 

law(q) = Dir(ag). Let Q = Y$=l 1* S &i and also set 1o : = °- If \\Q ~ 1*\U = Ei=o \n ~ 
q*\ < (5 r /diam(6), then 

k k 

W r T {Q, Qo) < S r 1i A q* + £ I* " lt\ diam(G) < 28 r . 

i=l i=l 

Defined = {(q x , . . . ,q k ,qo) G A fe ~ 1 such that \ qi - q*\ < 2fcdi ^ m(e) for i = 

0, . . . , k - 1}. If q G £ then \\q - q*\\i < 2 Y^Zo I* ~ 9*1 ^ S '' I diam(9). It fol- 
lows that 



F(W r r (Q,Qo) <2S r ) >F(£) 

k-1 k-1 



r(a) 



Ili=o r ( a 5i) 



i=0 i=l 



> k T{a) ff/ ^dft 



t -i5 T '/2fediam(e)) + ,(g*+<5 r /2fcdiam(e)) ++ ) 

T(a) (572/cdiam(e)) a90 
nto r («^) X «9o 



> x ^ / ^ Umm ^ x (5V2fediam(e)) fc - 1 . 



Here, a + = a V and a ++ = a A 1; the second and the third inequality in the above display 
are due to q* 9l ~ X > 1 for all i = 1, . . . , k as a < 1. Finally, note that T(agi)(agi) = 
T(agi + 1) < 1 for % = 0, . . . , k. So, 



k / r r \ ago+k-1 

nW r r {QM<25 r )>T{a)\{(ag i ){ 

i=l 



2A;diam(9) 



\ a+k—l k 



£r < a »< a / 2 »'(sd^(e)) n&- 



1=1 



Part (b) follows from part (a). Its argument is similar to the proof of Theorem [3] is 
omitted. 



8.4 Proof of Lemma H 



Thanks to Sethuraman ISethuramanL 1199411 . a random measure whose law is V may be 
parameterized as Q = Y^S=iPi&8i> where #j's are iid according to G, and p^s are dis- 
tributed according to a "stick-breaking" process: p\ = v\,p2 = V2{1 — v\), . . . ,pk = 



Vk rii=i (1 — v i) f° r an y ^ = 1)2,..., where v%, v^, ■ ■ ■ are beta random variables iid ac- 
cording to Beta(l, a). 



33 



It is simple to check that 1 — Yli=iPi = nf=i(l ~~ v i)- By Markov's inequality, 
for any e > 0, under V measure P(l - Ya=\Vi > e) = ^dlLil 1 ~ v i) ^ e ) ^ 
inf t>o nti E (! " = inft>o[a/(a + i)] fc /e' = exp[-alog(l/e) - fc log fc + 

Mog(ea) + fcloglog(l/e)] = A(e, k). 

Define B k to be the subset of discrete measures Q that lie in the support of the Dirichlet 
process, such that under stick-breaking representation described above, 1 — Y^%=\Pi < 
(<5/diam(9)) r . Hence, V(0>(@) \ B k ) < A((5/ diam(8)) r , k). This concludes part (c). 

For any Q = YliZi Pifoi such that 1 ~ Ya=i Pi < e > b Y choosing Q' = Ya=i P^e, + 
(l-Ei=iPi)^ G a fc (e)wehave^ r (Q,g / ) < e^diamte). Thus, infg' eQfc W r (Q, Q') < 
eV r diam(G) = 5, concluding ( a). For (b ) we ob tain that log N(2S, B k , W r ) < log N(S, Q k , W r ). 
Combining with Lemma 4(b) of Nguyenl 1 2012ah (cf . Eq. (l33l ) to conclude. 



8.5 Proof of Lemma M 

By Lemma QIb), W£(@ aG , S> aGo ) = W^(G,G ) < ediam(6) r - 1 . This implies that 
there exists a coupling K € T(& aG , &aG ) such that / W^{Q, Q')dK < ediam(G) r ~ 1 . 
Let Qq C ^(0) be the support of *2$ aGo — trns consists of probability measures that 
share the same k support points with Go- Write W r (Q, Qo) := infp S Q W r (Q,P). Then 
JWZ(Q,Qo)dD = JW r r {Q,Q )d)C < f W r r (Q,Q')d)C < ediam(G) r - 1 . Let£ = {Q : 
W r (Q, Qo) < 5}. By Markov's inequality, V{Q : W r r {Q, Qo) > 5 r ) < e dian^e) 7 ^ 1 /^' , 
yi elding (b ) . More over, log N(25, B,W r ) < log (5, Qo, W r ). Combining with Lemma 4(b) 
of iNguvenl l2012all immediately yields (a). 
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